Search CORE

6 research outputs found

Why do These Match? Explaining the Behavior of Image Similarity Models

Author: Forsyth David
Petsiuk Vitali
Plummer Bryan A.
Saenko Kate
Vasileva Mariya I.
Publication venue
Publication date: 01/01/2020
Field of study

Explaining a deep learning model can help users understand its behavior and allow researchers to discern its shortcomings. Recent work has primarily focused on explaining models for tasks like image classification or visual question answering. In this paper, we introduce Salient Attributes for Network Explanation (SANE) to explain image similarity models, where a model's output is a score measuring the similarity of two inputs rather than a classification score. In this task, an explanation depends on both of the input images, so standard methods do not apply. Our SANE explanations pairs a saliency map identifying important image regions with an attribute that best explains the match. We find that our explanations provide additional information not typically captured by saliency maps alone, and can also improve performance on the classic task of attribute recognition. Our approach's ability to generalize is demonstrated on two datasets from diverse domains, Polyvore Outfits and Animals with Attributes 2. Code available at: https://github.com/VisionLearningGroup/SANEComment: Accepted at ECCV 202

arXiv.org e-Print Archive

Crossref

Boston University Institutional Repository (OpenBU)

Human Evaluation of Text-to-Image Models on a Multi-Task Benchmark

Author: Buonassisi Tonio
Chin Zad
Drori Iddo
Hicke Yann
Hunter Gregory
Kerret Ori
Petsiuk Vitali
Plummer Bryan A.
Raghavan Arvind
Saenko Kate
Siemenn Alexander E.
Solar-Lezama Armando
Surbehera Saisamrit
Tyser Keith
Publication venue
Publication date: 22/11/2022
Field of study

We provide a new multi-task benchmark for evaluating text-to-image models. We perform a human evaluation comparing the most common open-source (Stable Diffusion) and commercial (DALL-E 2) models. Twenty computer science AI graduate students evaluated the two models, on three tasks, at three difficulty levels, across ten prompts each, providing 3,600 ratings. Text-to-image generation has seen rapid progress to the point that many recent models have demonstrated their ability to create realistic high-resolution images for various prompts. However, current text-to-image methods and the broader body of research in vision-language understanding still struggle with intricate text prompts that contain many objects with multiple attributes and relationships. We introduce a new text-to-image benchmark that contains a suite of thirty-two tasks over multiple applications that capture a model's ability to handle different features of a text prompt. For example, asking a model to generate a varying number of the same object to measure its ability to count or providing a text prompt with several objects that each have a different attribute to identify its ability to match objects and attributes correctly. Rather than subjectively evaluating text-to-image results on a set of prompts, our new multi-task benchmark consists of challenge tasks at three difficulty levels (easy, medium, and hard) and human ratings for each generated image.Comment: NeurIPS 2022 Workshop on Human Evaluation of Generative Models (HEGM

arXiv.org e-Print Archive

Beyond the Visual Analysis of Deep Model Saliency

Author: Andrea Zunino
Jianming Zhang
Kate Saenko
Sarah Adel Bargal
Stan Sclaroff
Vitali Petsiuk
Vittorio Murino
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2022
Field of study

Increased explainability in machine learning is traditionally associated with lower performance, e.g. a decision tree is more explainable, but less accurate than a deep neural network. We argue that, in fact, increasing the explainability of a deep classifier can improve its generalization. In this chapter, we survey a line of our published work that demonstrates how spatial and spatiotemporal visual explainability can be obtained, and how such explainability can be used to train models that generalize better on unseen in-domain and out-of-domain samples, refine fine-grained classification predictions, better utilize network capacity, and are more robust to network compression

Catalogo dei prodotti della ricerca